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ABSTRACT 


Educational systems typically contain a large pool of items 
(questions, problems). Using data mining techniques we can 
group these items into knowledge components, detect du- 
plicated items and outliers, and identify missing items. To 
these ends, it is useful to analyze item similarities, which can 
be used as input to clustering or visualization techniques. 
We describe and evaluate different measures of item similar- 
ity that are based only on learners’ performance data, which 
makes them widely applicable. We provide evaluation using 
both simulated data and real data from several educational 
systems. The results show that Pearson correlation is a suit- 
able similarity measure and that response times are useful 
for improving stability of similarity measures when the scope 
of available data is small. 


1. INTRODUCTION 


Interactive educational systems offer learners items (prob- 
lems, questions) for solving. Realistic educational systems 
typically contain a large number of such items. This is par- 
ticularly true for adaptive systems, which try to present suit- 
able items for different kinds of learners. The management 
of a large pool of items is difficult. However, educational 
systems collect data about learners’ performance and the 
data can be used to get insight into item properties. In this 
work we focus on methods for computing item similarities 
based on learners’ performance data, which consists of bi- 
nary information about the answers (correct /incorrect). 


Automatically detected item similarities are the first and 
necessary step in further analysis such as clustering of the 
items, which is useful in several ways, with one particular 
application being learner modeling [9]. Learner models es- 
timate knowledge and skills of learners and are the basis 
of adaptive behavior of educational systems. A learner’s 
models requires a mapping of items into knowledge compo- 
nents [17]. Item clusters can serve as a basis for knowledge 
component definition or refinement. The specified knowl- 
edge components are relevant not only for modeling, but 
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they are typically directly visible to learners in the user in- 
terface of a system, e.g., in a form of open learner model 
visualizing the estimated knowledge state, or in a personal- 
ized overview of mistakes, which is grouped by knowledge 
components. 


Information about items is also very useful for management 
of the content of educational systems — preparation of new 
items, filtering of unsuitable items, preparation of explana- 
tions, and hint messages. Information about item similari- 
ties and clusters can be also relevant for teachers as it can 
provide them an inspiration for “live” discussions in class. 
This type of applications is in line with Baker’s argument [1] 
for focusing on the use of learning analytics for “leveraging 
human intelligence” instead of its use for automatic intelli- 
gent methods. 


Item similarities and clusters are studied not only in ed- 
ucational data mining but also in a closely related area of 
recommender systems. The setting of recommender systems 
is in many aspects very similar to educational systems — in 
both cases we have users and items, just instead of “perfor- 
mance” (the correctness of answers, the speed of answers) 
recommender systems consider “ratings” (how much a user 
likes an item). Item similarities and clustering techniques 
have thus been also considered in the recommender systems 
research (we mention specific techniques below). There is a 
slight, but important difference between the two areas. In 
recommender systems item similarities and clusterings are 
typically only auxiliary techniques hidden within a “recom- 
mendation black box”. In educational system, it is useful to 
make these results explicitly available to system developers, 
curriculum production teams, or teachers. 


There are two basic approaches to dealing with item similar- 
ities and knowledge components: a “model based approach” 
and an “item similarity approach”. The basic idea of the 
model based approach is to construct a simplified model that 
explains the observed data. Based on a matrix of learners’ 
answers to items we construct a model that predicts these 
answers. Typically, the model assigns several latent skills to 
learners and uses a mapping of items to corresponding latent 
factors. This kind of models can often be naturally expressed 
using matrix multiplication, i.e., fitting a model leads to ma- 
trix factorization. Once we fit the model to data, items that 
have the same value of a latent factor can be denoted as 
“similar”. This approach leads naturally to multiple knowl- 
edge components per skill. The model is typically computed 
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using some optimization technique that leads only to local 
optima (e.g., gradient descent). It is thus necessary to ad- 
dress the role of initialization, and parameter setting of the 
search procedure. In recommender systems this approach is 
used for implementation of collaborative filtering; it is often 
called “singular value decomposition” (SVD) [18]. In edu- 
cational context many variants of this approach have been 
proposed under different names and terminology, e.g., Q- 
matrix [3], non-negative matrix factorization techniques [8], 
sparse factor analysis [19], or matrix refinement [10]. 


With the item similarity approach we do not construct an 
explicit model of learners’ behavior, but we compute directly 
a similarity measure for each pairs of items. These similar- 
ities are then used to compute clusters of items, to project 
items into a plane, or for other analysis (e.g., for each item 
listing the 3 most similar items). This approach naturally 
leads to a mapping with a single knowledge component per 
item (ie., different kind of output from most model based 
methods). One advantage of this approach is easier inter- 
pretability. In recommender system research this approach 
is called neighborhood-based methods [11] or item-item col- 
laborative filtering [7]. Similarity has been used for clus- 
tering of items [23, 24] and also for clustering of users [29]. 
In educational setting item similarity has been analyzed us- 
ing correlation of learners’ answers [22] and problem solving 
times [21], and also using learners’ wrong answers [25]. 


So far we have discussed methods that are based only on 
data about learners’ answers. Often we have some additional 
information about items and their similarities, e.g., a man- 
ual labeling or data based on syntactic similarity of items 
(text of questions). For both model based and item similar- 
ity approaches previous research has studied techniques for 
combination of these different types of inputs [10, 21]. 


In this work we focus on the item similarity approach, be- 
cause in the educational setting this approach is less ex- 
plored than the model based approach. We discuss specific 
techniques, clarify details of their usage, and provide evalua- 
tion using both data from real learners and simulated data. 
Simulated data are useful for evaluation of the considered 
unsupervised machine learning tasks, because in the case of 
real-world data we do not know the “ground truth”. 


The specific contributions of this work are the following. We 
provide guidelines for the choice of item similarity measures 
—we discuss different options and provide results identifying 
suitable measures (Pearson, Yule, Cohen); we also demon- 
strate the usefulness of “two step similarity measures”. We 
explore benefits of the use of response time information as 
supplement to usual information of correctness of answer. 
We use and discuss several evaluation methods for the con- 
sidered tasks. We specifically consider the issue of “how 
much data do we need”. This is often practically more im- 
portant than the exact choice of a used technique, but the 
issue is rather neglected in previous work. 


2. MEASURES OF ITEM SIMILARITY 


Figure 1 provides a high-level illustration of the item sim- 
ilarity approach. This approach consist of two steps that 
are to a large degree independent. At first, we compute an 
item similarity matrix, i.e., for each pair of items 7,7 we 
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Figure 1: High-level illustration of the general ap- 
proach to item analysis based on item similarities. 


compute similarity s;; of these items. At second, we can 
construct clusters or visualizations of items using only the 
item similarity matrix. 


Experience with clustering algorithms suggests that the ap- 
propriate choice of similarity measure is more important 
than choice of clustering algorithm [13]. The choice of simi- 
larity measure is domain specific and it is typically not ex- 
plored in general research on clustering. Therefore, we focus 
on the first step — the choice of similarity measure — and ex- 
plore it for the case of educational data. 


2.1 Basic Setting 

In this work we focus on computing item similarities using 
learners’ performance data. As Figure 1 shows, the simi- 
larity computation can also utilize information from domain 
experts or automatically determined information based on 
the inner structure of items (e.g., text of questions or some 
available meta-data). 


We discuss different possibilities for computation of item 
similarities. Note that in our discussion we consistently use 
“similarity measures” (higher values correspond to higher 
similarity), some related works provide formulas for dissim- 
ilarity measures (distance of items; lower values correspond 
to higher similarity). This is just a technical issue, as we can 
easily transform similarity into dissimilarity by subtraction. 


The input to item similarity computation are data about 
learner performance, i.e., a matrix L x I, where L is the 
number of learners and J is the number of items. The ma- 
trix values specify learners’ performance. The matrix is typ- 
ically very sparse (many missing values). The output of the 
computation is an item similarity matrix, which specifies 
similarity for each pair of items. 


Note that in our discussion we mostly ignore the issue of 
learning (change of learners skill as they progress through 
items). When learning is relatively slow and items are pre- 
sented in a randomized order, learning is just a reasonably 
small source of noise and does not have a fundamental im- 
pact on the computation of item similarities. In cases where 
learning is fast or items are presented in a fixed order, it 
may be necessary to take learning explicitly into account. 


2.2 Correctness of Answers 
The basic type of information available in educational sys- 
tems is the correctness of learners’ answers. So we start with 
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similarity measures that utilize only this type of informa- 
tion, i.e., dichotomous data (correct/incorrect) on learners’ 
answers on items. The advantage of these measures is that 
they are applicable in wide variety of settings. 


With dichotomous data we can summarize learners’ perfor- 
mance on items 7 and j using an agreement matrix with 
just four values (Table 1). Although we have just four val- 
ues to quantify the similarity of items i and j, previous re- 
search has identified large number of different measures for 
dichotomous data and analyzed their relations [5, 12, 20]. 
For example Choi et al. [5] discuss 76 different measures, al- 
beit many of them are only slight variations on one theme. 
Similarity measures over dichotomous data are often used in 
biology (co-occurrence of species) [14]. A more directly rele- 
vant application is the use of similarity measures for recom- 
mendations [30]. Recommender systems typically use either 
Pearson correlation or cosine similarity for computation of 
item similarities [11], but they consider richer than binary 
data. 


Table 1: An agreement matrix for two items and def- 
initions of similarity measures based on the agree- 
ment matrix (n=a+6+c+d is the total number of 
observations). 


item 2 

incorrect correct 
item j incorrect a b 
correct c d 


Yule Sy = (ad — be)/(ad + bc) 
Pearson Sp» = (ad — be)/,/(a 


Cohen Sc = (Po — Pe)/(1 — Pe) 
Po = (a+d)/n 
P. = ((a+b)(at+c)+ (b+ d)(e+d))/n? 


s =(a+d)/(at+b+c+d) 
3 =a/(atbtc) 


Ochiai S, =a//(at+b)(a+c) 


Sokal 


Jaccard 


Table 1 provides definitions of 6 measures that we have cho- 
sen for our comparison. In accordance with previous re- 
search (e.g., [5, 14]) we call measures by names of researchers 
who proposed them. The choice of measures was done in 
such a way as to cover measures used in the most closely re- 
lated work and measures which achieved good results (even 
if the previous work was in other domains). We also tried 
to cover different types of measures. 


Pearson measure is the standard Pearson correlation coef- 
ficient evaluated over the dichotomous data. In the con- 
text of dichotomous data it is also called Phi coefficient or 
Matthews correlation coefficient. Yule measure is similar 
measure, which achieved good results in previous work [30]. 
Cohen measure is typically used as a measure of inter-rater 
agreement (it is more commonly called “Cohen’s kappa”). 
In our setting it makes sense to consider this measure when 


we view learners’ answers as “ratings” of items. Relations 
between these three measures are discussed in [32]. 


Ochiai coefficient is typically used in biology [14]. It is also 
equivalent to cosine similarity evaluated over dichotomous 
data; cosine similarity is often used in recommender sys- 
tems for computing item similarity, albeit typically over in- 
terval data [7]. Sokal measure is also called Sokal-Michener 
or “simple matching”. It is equivalent to accuracy measure 
used in information retrieval. Together with Jaccard mea- 
sure they are often used in biology, but they have also been 
used for clustering of educational data [12]. 


Note that some similarity measures are asymmetric with re- 
spect to 0 and 1 values. These measures are typically used 
in contexts where the interpretation of binary values is pres- 
ence/absence of a specific feature (or observation). In the 
educational context it is more natural to use measures which 
treat correct and incorrect answers symmetrically. Never- 
theless, for completeness we have included also some of the 
commonly used asymmetric measures (Ochiai and Jaccard). 
In these cases we focus on incorrect answers (value a as op- 
posed to d) as these are typically less frequent and thus bear 
more information. 


2.3 Other Data Sources 


The correctness of answers is the basic source of informa- 
tion about item similarities, but not the only one. We 
can also use other data. The second major type of per- 
formance data is response time (time taken to answer an 
item). The basic approach to utilization of response time 
is to combine it with the correctness of an answer. Given 
the correctness value c € {0,1}, a response time t € R™, 
and the median of all response times 7, we combine them 
into a single score r. Examples of such transformations 
are: linear transformation for correct answers only (r = 
c:-max(1 — t/2r, 0)); exponential discounting used in Mat- 
Mat [28] (r = c- min(1, 0.9'/7~")); linear transformation 
inspired by high speed, high stakes scoring rule used in Math 
Garden [16] (r = (2c — 1)-maa(1 — t/27, 0)). The first 
approach was used in our experiment due to its simplicity 
and high influence of response time information. 


The scores obtained in this way are real numbers. Given the 
scores it is natural to compute similarity of two items using 
Pearson correlation coefficient of scores (over learners who 
answered both items). It is also possible to utilize specific 
wrong answers for computation of item similarity [25]. 


It is also possible to combine performance based measures 
with other types of data. For example we may estimate 
item similarity based on analysis of the content of items 
(syntactical similarity of texts), or collect expert opinion 
(manual categorization of items into several groups). The 
advantage of the similarity approach (compared to model 
based approach) is that different similarity measures can be 
usually combined in straightforward way by using a weighted 
average of different measures. 


2.4 Second Level of Item Similarity 

The basic computation of item similarities computes simi- 
larity of items i and j using only data about these two items. 
To improve a similarity measure, it is possible to employ a 
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“second of level of item similarity” that is based on the com- 
puted item similarity matrix and uses information on all 
items. Examples of such a second step is Euclidean distance 
or correlation. Similarity of items 7 and j is given by the 
Euclidean distance or Pearson correlation of rows i and j 
in the similarity matrix. Note that Euclidean distance may 
be used implicitly when we use standard implementation of 
some clustering algorithms (e.g., k-means). 


With the basic approach to item similarity, we consider 
items similar when performance of learners on these items is 
similar. With the second step of item similarity, we consider 
two items similar when they behave similarly with respect 
to other items. The main reason for using this second step 
is the reduction of noise in data by using more informa- 
tion. This may be useful particularly to deal with learning. 
Two very similar items may have rather low direct similar- 
ity, because getting a feedback on the first item can strongly 
influence the performance on the second item. However, we 
expect both items to have similar similarities to other items. 


A more technical reason to using the second step (partic- 
ularly the Euclidean distance) is to obtain a measure that 
is a distance metric. The measures described above mostly 
do not satisfy triangle inequality and thus do not satisfy 
the requirements on distance metric; this property may be 
important for some clustering algorithms. 


3. EVALUATION 


In this work we focus on item similarity, but we keep the 
overall context depicted in Figure 1 in mind. The quality of 
a visualization is to a certain degree subjective and difficult 
to quantify, but the quality of clusters can be quantified and 
thus we can use it to compare similarity measures. From 
the large pool of existing clustering algorithms [15] we con- 
sider k-means, which is the most common implementation 
of centroid-based clustering, and hierarchical clustering. We 
used agglomerative or “bottom up” approach where items 
are successively merged to clusters using Ward’s method as 
linkage criteria. 


3.1 Data 


We use data from real educational systems as well as sim- 
ulated learner data. Real-world data provide information 
about the realistic performance of techniques, but the eval- 
uation is complicated by the fact that we do not know the 
“ground truth” (the “correct” similarity or clusters of items). 
Simulated data provide a setting that is in many aspects 
simplified but allows easier evaluation thanks to the access 
to the ground truth. 


For generating simulated data we use a simple approach 
with minimal number of assumptions and ad hoc param- 
eters. Each item belongs to one of k knowledge compo- 
nents. Each knowledge component contains n items. Each 
item has a difficulty generated from the standard normal 
distribution d; ~ (0,1). Skills of learners with respect to 
individual knowledge components are independent. Skill of 
a learner | with respect to knowledge component 7 is gen- 
erated from the standard normal distribution 6,; ~ (0, 1). 
We assume no learning (constant skills). Answers are gen- 
erated as Bernoulli trials with the probability of a correct 
answer given by the logistic function of the difference of a 
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Table 2: Data used for analysis. 


learners items answers 
Czech 1 (adjectives) 1134 108 62613 
Czech 2 4567 210 336 382 
MatMat: numbers 6434 60 67 753 
MatMat: addition 3580 135 20 337 
Math Garden: addition 83 297 30 881994 
Math Garden: multiplic. 97 842 30 1233024 


relevant skill and an item difficulty (a Rasch model): p = 
exp(6,; — di)~'. This approach is rather standard, for ex- 
ample Piech at al. [26] use very similar procedure and also 
other works use closely related procedures [4, 12]. In the 
experiment reported below the basic setting is 100 learners, 
5 knowledge components with 20 items each. 


To evaluate techniques on realistic educational data, we use 
data from three educational systems. Table 2 describes the 
size of the used data sets. 


Umime Cesky (umimecesky.cz) is a system for practice of 
Czech spelling and grammar. We use data only from one ex- 
ercise from the system — simple “fill-in-the-blank” questions 
with two options. We use only data on the correctness of 
answers (response time is available, but since it depends on 
the text of a particular item its utilization is difficult). We 
focus particularly on one subset of items: questions about 
the choice between i/y in suffixes of Czech adjectives. For 
this subset we have manually determined 7 groups of items 
corresponding to Czech grammar rules. 


MatMat (matmat.cz) is a system for practice of basic arith- 
metic (e.g., counting, addition, multiplication). For each 
item we know the underlying construct (e.g., “13” or “7 + 
8”) and also the specific form of questions (e.g., what type of 
visualization has been used). We use data on both correct- 
ness and response time. We selected the two largest subsets: 
multiplication and numbers (practice of number sens, count- 


ing). 


Math Garden is another system for practice of basic arith- 
metic [16]. This system is more widely used than MatMat, 
but we do not have direct access to the system and detailed 
data. For the analysis we reuse publicly available data from 
previous research [6]. The available data contain both cor- 
rectness of answers and response times, but they contain 
information only about 30 items without any identification 
of these items. 


3.2 Comparison of Similarity Measures 

To evaluate similarity measures we consider several types 
of analysis. With simulated data, we analyze the similarity 
measures with respect to the ground truth while for real- 
world data we evaluate correlations among similarity mea- 
sures. We also compare the quality of subsequent cluster- 
ings using adjusted Rand index (ARI) [27, 31], which mea- 
sures the agreement of two clusterings (with a correction for 
agreement due to chance). Typically, we use the adjusted 
Rand index to compare the clustering with a ground truth 
(available for simulated data) or with a manually provided 
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Figure 2: Differences between similarity values 
inside knowledge components and between them. 
Simulated data set with the basic setting were used. 


classification (available for the Czech 1 data set). It can be 
also used to compare two detected clusterings (clusterings 
based on two different algorithms or clusterings based on 
two independent halves of data). 


As a first step in the evaluation of similarity measures, we 
consider experiments with simulated data where we can uti- 
lize the ground truth. In clustering we expect high within- 
cluster similarity values and low between-cluster similarity 
values. Figure 2 shows distribution of the similarity values 
for selected measures and suggest which measures separate 
within-cluster and between-cluster values better and there- 
fore which measures will be more useful in clustering. The 
results show that for Jaccard and Sokal measures the val- 
ues overlap to a large degree, whereas Pearson and Yule 
measures provide better results. Adding the second step — 
Pearson correlation in this example — to the similarity mea- 
sure separates within-cluster and between-cluster values bet- 
ter. That suggests that extending similarities in this way is 
not only necessary step for some subsequent algorithms such 
as k-means but also a useful technique with better perfor- 
mance. 


For data coming from real systems we do not know the 
ground truth and thus we can only compare the similar- 
ity measures to each other. To evaluate how similar two 
measures are we take all similarity values for all item pairs 
and computed correlation coefficient. Figure 3 shows results 
for two data sets which are good representatives of over- 
all results. Pearson and Cohen measures are highly corre- 
lated (> 0.98) across all data sets and have nearly the same 
values (although not exactly the same). Larger differences 
(but only up to 0.1) can be found typically when one of the 
values in the agreement matrix is small and that happens 
only for poorly correlated items with the resulting similar- 
ity value around 0. The second pair of highly correlated 
measures is Ochiai and Jaccard, which are both asymmetric 
with respect to the agreement matrix. The correlation be- 
tween these two pairs of measures vary depending on data 
set and in some cases drops up to 0.5. Because of this high 
correlation within these pairs we further report results only 
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Figure 3: Correlations of similarity measures. 


for Pearson and Jaccard measures. Yule measure is usually 
similar to Pearson measure (correlation usually around 0.9). 
The main difference is that the Yule measure spreads values 
more evenly across the interval [-1, 1]. Sokal is the most 
outlying measure with no correlation or small correlation 
(usually < 0.6) with all other measures. 


Figure 4 shows the effect of the second levels of item sim- 
ilarity on the Pearson measure (results for other measures 
are analogical). The Euclid distance as second level similar- 
ity brings larger differences (lower correlation) than Pearson 
correlation. The correlations for large data sets such as Math 
Garden are usually high (> 0.9) and conversely the lowest 
correlations are found in results for small data sets. This 
suggests that the second level of similarity is more signifi- 
cant, and thus potentially more useful, where only limited 
amount of data is available. 
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Figure 4: Correlations of Pearson measure and Pear- 
son with different second levels. 


Finally, we evaluate the quality of the similarity measures 
according to the performance of the subsequent clustering. 
From the two considered clustering methods we used the hi- 
erarchical clustering in this comparison because it naturally 
works with similarity measure and does not require metric 
space. The other two methods have similar result with same 
conclusions. Table 3 and Figure 5 show results. Although 
the results are dependent on the specific data set and the 
used clustering algorithm, there is quite clear general con- 
clusion. Pearson and Yule measures provide better results 
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Figure 5: The quality of clustering for different mea- 
sures used in the second step of item similarity. Top: 
Simulated data with 5 correlated skills. Bottom: 
Czech grammar with 7 manually determined clus- 
ters. 


than Jaccard and Sokal, i.e., for the considered task the 
later two measures are not suitable. The Pearson is usually 
slightly better than Yule but the choice between them seems 
not to be fundamental (which is not surprising given that 
they are highly correlated). The results also show that the 
“second step” is always useful. The result for simulated data 
favor Euclidean distance over Pearson but there are almost 
no differences for real-world data. 


3.3. Do We Have Enough Data? 


In machine learning the amount of available data often is 
more important than the choice of a specific algorithm [2]. 
Our results suggest that once we choose a suitable type of 
similarity measure (e.g., Pearson, Cohen, or Yule), the dif- 
ferences between these measures are not fundamental, the 
more important issue becomes the size of available data. 


Specifically, for a given data set we want to know whether 
the data are sufficiently large so that the computed item 
similarities are meaningful and stable. This issue can be ex- 
plored by analyzing confidence intervals for computed sim- 
ilarity values. As a simple approach to analysis of similar- 
ity stability we propose the following approach: We split 
the available data into two independent halves (in a learner 
stratified manner), for each half we compute the item simi- 
larities, and we compute the correlation of the resulting item 
similarities. 
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Figure 6: Stability of similarity measure (Yule) for 
real-world data sets. Data set was sampled, split 
to halves and Pearson correlation was computed for 
similarity values. Numbers on the right side indicate 
thousands of answers in data sets. 


We can also perform this computation for artificially reduced 
data sets — this shows how the stability of results increases 
with the size of data. Figure 6 shows this kind of analysis 
for our data (real-world data sets). We clearly see large dif- 
ferences among individual data sets. Math Garden data set 
contains large number of answers and only a few items, the 
results show excellent stability, clearly in this case we have 
enough data to analyze item similarities. For the Czech 
grammar data set we have large number of answers, but 
these are divided among relatively large number of items. 
The results show a reasonably good stability, the data are 
usable for analysis, but clearly more data can bring improve- 
ment. For MatMat data the stability is poor, to draw solid 
conclusions about item similarities we need more data. 


3.4 Response Time Utilization 

The incorporation of response time information to similar- 
ity measure can change the meaning of similarity. Figure 7 
gives such example and shows projection of items from Mat- 
Mat practicing number sense. Similar items according to 
measures using only correctness of answers tend to be items 
with the same graphical representation in the system. On 
the other hand, similar items according to measures using 
also response time are usually items practicing close num- 
bers. 


We used this method also on data sets from Math Garden, 
which are much larger. In this case the use of response 
times has only small impact on the computed item similari- 
ties (correlations between 0.9 and 0.95). However, the use of 
response times influences how quickly does the computation 
converge, i.e., how much data do we need. To explore this 
we consider as the ground truth the average of computed 
similarity matrices with and without response times for the 
whole data set. Then we used smaller samples of the data 
set, used them to compute item similarities and checked the 
agreement with this ground truth. Figure 8 shows the dif- 
ference between speed of convergence of measure with and 
without response time utilization. Results shows that the 
measure which use addition information from response time 
converges to ground truth much faster. This result suggests 
that the use of response time can improve clustering or visu- 
alizations when only small number of answers are available. 
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Table 3: Comparison of similarity measures for one real-world data (with sampled students) set and simulated 
data sets with c knowledge components and | learners. The values provide the adjusted Rand index (with 
0.95 confidence interval) for a hierarchical clustering computed based on the specific similarity measure. The 
top result for every data set is highlighted. 


Czech 1 (c=7) 1=50,c=5 1=100,c=5 1=200,c=5 1=100,c=2 1=100,c=10 


Pearson 0.32 + 0.02 0.26 + 0.04 0.48 + 0.05 0.84 + 0.05 0.77 40.12 0.34 + 0.04 
Jaccard 0.31 + 0.03 0.06 + 0.03 0.15 + 0.04 0.29 + 0.08 0.32 + 0.18 0.09 + 0.02 
Yule 0.31 + 0.03 0.19 + 0.04 0.43 + 0.05 0.77 + 0.07 0.60 + 0.15 0.31 + 0.03 
Sokal 0.15 + 0.06 0.11 + 0.02 0.18 + 0.03 0.25 + 0.05 0.12 + 0.11 0.14 + 0.02 
Pearson — Euclid 0.43 + 0.01 0.45 + 0.05 0.80 + 0.06 0.98 + 0.01 0.95 + 0.03 0.67 + 0.04 
Yule — Euclid 0.32 + 0.02 0.36 + 0.05 0.65 + 0.07 0.94 + 0.04 0.89 + 0.11 0.43 + 0.03 
Pearson — Pearson 0.41 + 0.03 0.39 + 0.05 0.73 + 0.06 0.96 + 0.02 0.92 + 0.03 0.55 + 0.04 
Yule — Pearson 0.32 + 0.03 0.38 + 0.05 0.72 + 0.06 0.97 + 0.02 0.94 + 0.04 0.55 + 0.05 
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Figure 7: Projection of items practicing number 
sense from MatMat system. Left: Measure based 
only correctness. Right: Measure using response 
time. Opacity corresponds to the number value of 
the item and color corresponds to the graphical rep- 
resentation of the task. 


4. DISCUSSION 


Our focus is the automatic computation of item similarities 
based on learners’ performance data. These similarities can 
be then used in further analysis of an item relations such as 
an item clustering or a visualization. This outlines direction 
for future work in which methods using the item similarities 
should be studied in more detail. Compared to alternative 
approaches that have been proposed for the task (e.g., ma- 
trix factorizations, neural networks), the item similarity ap- 
proach is rather straightforward, easy to realize, and it can 
be easily combined with other sources of information about 
items (text of items, expert opinion). For these reasons the 
item similarity approach should be used at least as a baseline 
in proposals for more complex methods like deep knowledge 
tracing [26]. 


The most difficult step in this approach is the choice of a 
similarity measure. Once we make a specific choice, the re- 
alization of the approach is easy. Our results provide some 
guidelines for this choice. Pearson, Yule, and Cohen mea- 
sures lead to significantly better results than Ochiai, Sokal, 
and Jaccard measures. It is also beneficial to use the second 
step of item similarity (e.g., the Euclidean distance over vec- 
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Figure 8: The speed of convergence to ground truth 
for measures with and without response time on 
Math Garden addition data set. 


tors of item similarities). The exact choice of details does not 
seem to make fundamental difference (e.g., Pearson versus 
Yule in the first step, the Euclidean distance versus Pear- 
son correlation in the second step). The Pearson correla- 
tion coefficient is a good “default choice”, since it provides 
quite robust results and is applicable in several settings and 
steps. It also has the pragmatic advantage of having fast, 
readily available implementation in nearly all computational 
environments, whereas measures like Yule may require ad- 
ditional implementation effort. 


The amount of data available is the critical factor for the suc- 
cess of automatic analysis of item relations. A key question 
for practical applications is thus: “Do we have enough data 
to use automated techniques?” In this work we used several 
specific methods for analysis of this question, but the issue 
requires more attention — not just for the item similarity 
approach, but also for other methods proposed in previous 
work. For example previous work on deep knowledge trac- 
ing [26], which studies closely related issues, states only that 
deep neural networks require large data without providing 
any specific quantification what ‘large’ means. The necess- 
sary quantity of data is, of course, connected to the quality 
of data — some data sources are more noisy than other, e.g., 
answers from voluntary practice contain more noise than an- 
swers from high-stakes testing. An important direction for 
future work is thus to compare model based and item simi- 
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larity approaches while taking into account the ‘amount and 
quality of data available’ issue. 
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